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We perceive our surrounding environment by using different sense organs. However, 
it is not clear how the brain estimates information from our surroundings from the 
multisensory stimuli it receives. While Bayesian inference provides a normative account 
of the computational principle at work in the brain, it does not provide information on how 
the nervous system actually implements the computation. To provide an insight into how 
the neural dynamics are related to multisensory integration, we constructed a recurrent 
network model that can implement computations related to multisensory integration. Our 
model not only extracts information from noisy neural activity patterns, it also estimates a 
causal structure; i.e., it can infer whether the different stimuli came from the same source 
or different sources. We show that our model can reproduce the results of psychophysical 
experiments on spatial unity and localization bias which indicate that a shift occurs in the 
perceived position of a stimulus through the effect of another simultaneous stimulus. The 
experimental data have been reproduced in previous studies using Bayesian models. By 
comparing the Bayesian model and our neural network model, we investigated how the 
Bayesian prior is represented in neural circuits. 

Keywords: causality inference, multisensory integration, spatial orientation, recurrent neural network, Mexican- 
hat type interaction 



1. INTRODUCTION 

We are surrounded by many sources of sensory stimulation, i.e., 
many sights and sounds. Moreover, we can recognize who is 
speaking in a conversation even when there are many people and 
sounds around us. To perform such recognition, we have to inte- 
grate correct pairs of stimuli; the movements of a person's mouth 
and the sound of his/her voice. Thus, it is important to determine 
how we judge which pairs of audiovisual stimuli are related and 
how we integrate related cues. That is, we must study multisen- 
sory integration in order to elucidate how our brains link multiple 
sources of information. There is a good example of audiovisual 
integration known as the ventriloquism effect in which the per- 
ceived location of a ventriloquist's voice is altered through the 
movement of a dummy's mouth (Howard and Templeton, 1966). 
It is also known that the ventriloquism effect can be elicited under 
artificial experimental conditions such as a spot of light or a beep 
(Bertelson and Aschersleben, 1998; Pavani et al, 2000; Lewald 
et al, 2001; Hairston et al., 2003; Alais and Burr, 2004; Wallace 
et al, 2004). Several theoretical models based on Bayesian infer- 
ence have been proposed to explain the data from psychophysical 
experiments on the ventriloquism effect (Kording et al., 2007; 
Sato et al., 2007). Although Bayesian inference gives a normative 
account as to the computational principle, it does not indicate 
how the nervous systems actually implement the computation. 

To provide insights into the neuron dynamics related to sen- 
sory integration, several studies have constructed neural network 
models that implement Bayesian inference (Pouget et al, 1998; 



Ma et al, 2006). When stimuli have a common cause, their mod- 
els are able to extract encoded information from the activities of 
large populations of neurons as reliably as the maximum like- 
lihood is able to do (Deneve et al, 1999; Latham et al, 2003). 
However, when stimuli have distinct sources, the models cannot 
work correctly because they bind cues even when the stimuli do 
not have the same source. When the stimuli have distinct causes, 
the brain has to estimate the causal structure of the stimuli and 
extract information separately from each stimuli. We constructed 
a recurrent network model that can implement computations 
related to multisensory integration by changing the method of 
divisive normalization in the model of Deneve et al. (1999). We 
found that our model could estimate not only the locations of the 
sources of the stimuli but also the number of sources. By using 
computer simulation, we showed that the model accounts for the 
data of psychophysical experiments that have been explained by 
the Bayesian model. To elucidate how our brains implement a 
Bayesian prior distribution, we tried to determine which neural 
connectivities represent the prior distribution. 

2. MODEL 

We constructed a single layer recurrent network consisting of 
N = 1000 analog neurons with identical spatial receptive fields. 
Here, we will label a neuron, i, by an angle 6,- and express the fir- 
ing rate as a function of 9; therefore, a neural state, u,, describes 
the firing rate of the neuron population (including both exci- 
tatory and inhibitory neurons) with the preferred angle, i. In 
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order to reduce the number of parameters and facilitate analy- 
sis of the system's behavior, we will study a simpler model, in 
which the excitatory and inhibitory populations are collapsed into 
a single equivalent population. To model a cortical hypercolumn 
consisting of a single layer of neurons, we assumed that the pre- 
ferred orientations are evenly distributed from —50 to 50 deg and 
divided 100 deg into N = 1000 sections, that is, 9; = 0.1 x i - 50 
deg. The neural state, is determined by inputs, a,, as 

N 

a,(t) = hi + Y^JijUj(t), (1) 

; 

where h\ represents an external input and the second term of 
the right-hand side of the equation represents a recurrent input. 
Using «j, we defined the firing rate «; as 



«;(*+ 1) 



[Oj(f)] H 



1 + h Ef [«;(0]h 



(2) 



cortex Shadlen and Newsome (1994). The standard deviations ay 
and a a respectively represent the uncertainties of the visual and 
audio input. Note that the strength of the input activity, -^==, 
is determined not only by M but also by the uncertainty of the 
input, a, in our model. We assumed that the visual input is more 
reliable than the audio input. To investigate the effect of the dif- 
ference in uncertainty between visual and audio input, we fixed 
My = Ma = 10, ay = l[deg], and oa = 2[deg]. Thus, the input 
strength of visual input is larger than that of audio input, i.e., 
An example of external input to the network 



is given in Figure 2. 

Now let us explain xy and xa in Equation 5. xy and xa rep- 
resent the input locations of audiovisual stimuli. We assume that 
the audio and visual stimuli are Gaussian distributed: 



x v ~ M(S V , a 2 ), x A ~ M(S A , a 2 .) 



(6) 



To keep m, positive, we used the threshold linear function 
([a,] + = <n if «j > 0, [«;]+ = 0 if a, < 0). To control the gain of 
the firing rate we used divisive normalization Carandini et al. 
(1997). The interaction in the network turns noisy input into a 
smooth hill shape. The cap coordinate of the hill gives an esti- 
mate of the orientation. In a previous study, Deneve et al. (1999) 
defined a function u,(f) in terms of the square of the input a, 



"(0 



a t (t) 2 



(3) 



In order to collapse the excitatory and the inhibitory populations 
into a single equivalent population, we assumed that the synap- 
tic weight, Jji, is a Mexican-hat-type connectivity: excitations are 
given to nearby neurons, inhibitions to distant neurons (Figure 1; 
Amari, 1977; Shadlen et al, 1996). We defined: 



Mi 



2jt erf 



■. exp 



-(e,-e y ) : 

20? 



Mi 



2jroi 



: exp 



-(6, - Qjf 



2ai 



(4) 

The parameters o\, 02, respectively define the range of the exci- 
tatory connection and lateral inhibition. Here, we set Mi = 
28, M2 = 10, 0i = 1.5 [deg], and 02 = 3 [deg]. The two features in 
our model, i.e., weak normalization and lateral inhibition, make 
differences between ours and Deneve's model, and they enable 
our model to reproduce the results of psychophysical experiments 
(as discussed in the Results). 

Let us consider an external input, h, from either a preceding 
layer or from the external world. The external input of neuron i, 
hi, is dependent on the orientation encoded in the previous layer 
and is Gaussian distributed with mean (hi) and variance a 2 . We 
define 



h r - 



My 



2JT(Xt; 



: exp 



(Bi-xyy 

24 
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■. exp 
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(5) 



where z; denotes noise. We set a 2 to the mean activity, i.e., 
a 2 = (hi), which better approximates the noise measured in the 
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FIGURE 1 | Mexican-hat-type connectivity excitations are given to 


nearby neurons, inhibitions to distant neurons. 
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FIGURE 2 I Example of input to the network (broken line) and mean 
input (solid line). 
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FIGURE 3 | Output of network model. (A) Common cause; (B) 
independent cause. 



Wefix0y x = 3 [deg] ando^ = 6.5 [deg]. Here, J\f([i, a p ) means 
a Gaussian distribution with mean u, and standard deviation a. 
The external input, hi, is given in the initial five steps (0 < t < 4). 
The noisy input, hi, determines the initial state of m, [fl,(0) = hi]. 
Because of the recurrent connections and neural dynamics (See 
Equation 1 and Equation 2), the noisy neural states become a 
smooth hill whose peak indicates the estimated position of the 
audiovisual stimuli (Figure 3). 

3. RESULTS 

By using computer simulations, we showed that our network 
model can estimate the position(s) of the sources of audiovi- 
sual stimuli with a disparity between the stimuli. We found 
that while previous models could not reproduce psychophysi- 
cal experiments of the audiovisual integration, our results are 
consistent with both experimental observations and Bayesian 
inference. If the disparity of the input stimuli was small (xa — 
xy = 5 [deg] ), the stimuli were integrated with a high rate (about 
70%) (Figure 3A). If the disparity was large, they were estimated 
as distinct stimuli (Figure 3B), something which could not be 
reproduced in previous models where the normalization term 
in Equation 2 is determined by the square sum Deneve et al. 
(1999). We found that in the previous network model, they were 
estimated not as distinct stimuli but as a united stimulus for 
any spatial disparity. The failure of Deneve's model to reproduce 
the phenomenon is partly the result of the strong divisive nor- 
malization they used (Equation 2), because the strong divisive 
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FIGURE 5 | Frequency of the network model estimating stimuli having 
a common cause. (A) Proportion of unity with sensory noise 
\x v ~ A/"(S|/, cr|, x ), xa ~ A/"(S/i, ojL)]. (B) Proportion of unity without noise 
(Sy = x v , S A = x A ). 



normalization prunes weak multiple input peaks and extracts 
the maximum peak. Another reason for the failure of reproduc- 
tion is the lateral inhibition between neurons. Figure 4 compares 
the models in the case of independent causes. Similarly to the 
Deneve's model, in a weak normalization model without lateral 
inhibition, they were estimated not as distinct stimuli but as a 
united stimulus for any spatial disparity, as shown in Figure 4B 
Marti et al. (2013). Thus, both weak normalization and lateral 
inhibition in our model are important for reproducing the results 
of the psychophysical experiments on audiovisual integration. 

3.1. EFFECT OF SENSORY NOISE 

We assumed that information about the orientations of the audio- 
visual stimuli from sense organs, xy, xa, are corrupted with sen- 
sory noise. This noise makes the output probabilistic (Figure 5A). 
If we didn't add noise, the number of sources would be completely 
determined by the spatial disparity D (Figure 5B). Experiments 
have shown that people estimate the number of sources stochas- 
tically Wallace et al. (2004). 

3.2. BIAS 

Psychophysical experimental research has reported that when 
audiovisual stimuli were estimated as distinct stimuli, the esti- 
mated position of the auditory stimuli was away from the actual 
position of the auditory input Wallace et al. (2004). To examine 
how the perception of common versus distinct causes affects the 
estimation of the auditory stimuli position, Sa, we calculated the 
localization bias, 



We performed 500 simulations and averaged the localization bias 
for each disparity between the audiovisual stimuli and for each 
case, i.e., common and distinct. We compared our model and the 
previous model of Deneve et al. (1999). In the previous model, 
the stimuli were unified with any spatial disparity (Figure 6A). 
The value of the localization bias was nearly 100% with all spatial 
disparities. This means their model estimated the audio stimulus 
as noise. Our model made estimates about whether stimuli have 
a common cause or distinct causes stochastically (Figure 6B). 
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FIGURE 4 | Model comparison in the case of independent cause. (A) 

Our network model; (B) weak normalization model without lateral inhibition. 
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When two stimuli were unified, the localization bias was nearly 
80%. This indicates that when there was a common cause, the 
estimated auditory stimulus would be at a position that was on 
average very close to that of the visual stimulus. On the other 
hand, in the case of distinct causes, the localization bias took a 
negative value and was increasingly negative for smaller dispar- 
ities. These results indicate that the estimated auditory position 
seems to be pushed away from the location of the visual stimulus, 
as was experimentally observed Wallace et al. (2004). 

4. BAYESIAN PRIOR IN A NEURAL NETWORK 

Bayesian inference is a method of reasoning that combines prior 
knowledge about the world with current input data. To be 
more precise, from experience we may learn how likely two co- 
occurring signals (visual and auditory signals) are to have a com- 
mon cause versus two independent causes. Using the Bayesian 
prior, a Bayesian inference model integrates those pieces of infor- 
mation to estimate if there is a common cause and to estimate 
the positions of cues. Previous studies have reported that Bayesian 
inference could explain the pattern of localization bias as repro- 
duced by our model (Figure 6C) (Kording et al, 2007; Sato 
et al., 2007). Considering that our neural network model and 
the Bayesian model could explain the same psychophysical exper- 
iment, there should be a neural connection in our model that 
represents prior information. We searched for the parameter of 
our network model that corresponded to the prior information 
of the likelihood of sensory integration. 

4.1. MULTISENSORY INTEGRATION IN THE NEURAL NETWORK 

To simplify the comparison between the network model and 
Bayes model, let us consider a case in which we receive sensory 
inputs without noise (xy = Sy, xa = Sa)- The distance between 
audiovisual stimuli D determines the causal structure in this case 
(Figure 5B), and we can determine the integration threshold D$ et 
(the distance within which the auditory and visual signals are 
integrated). When D^ et is determined, we can calculate the pro- 
portion of integration with noise as follows. When the distance 
between the audiovisual inputs, xy — xa — D{ n p ut , is lower than 
D^ et , stimuli integrate. Di nput is drawn from a normal distribu- 
tion with mean Sy — Sa = D, which is the distance between the 
original positions of the audiovisual stimuli, and standard devi- 



ation J <jy x + a^ x , which is the sum of the auditory and visual 

noise. Using D^', we obtain the proportion of integration as a 
function of D, 



signals are integrated in the Bayesian view) as follows (Kording 
et al, 2007). We determine whether the stimuli originate from the 
same source (C = 1) or two sources (C = 2). The perceived loca- 
tions of audiovisual stimuli xy, xa are shifted from their original 
position using Gaussian noise with standard deviations of ay, a a- 
Accordingly, we calculate the probability of C = 1 using Bayes' 
theorem (Kording et al., 2007): 



p(C = l\x v ,x A ) 

p(x v ,x A \C = l)p Cl 



p(xv,x A ) 



(P(C= 1) 



p(x V ,X A \C = l)p a 



p(xy, X A \C = l)p co + p(xy, X A \C = 2) (l - p co ) 

When the source locations from the audiovisual signals are uni- 
formly distributed in the spatial range [—a/2, a/2], we obtain 



p(C= l\x v ,x A ) 



aq(D)p c , 



aq(D)p C0 +l-p 

a 



where 



q(D) 



1 



2:t(cr^ + a\) 



: exp 



D 1 



2K + ai) 



(8) 



(9) 



We assume that the Bayesian model reports the same source when 
p(C = 1 \xy, xa) > p(C = 2\xy, xa)- We define D^ ay as a distance 
D that satisfies p(C = l\xy, xa) = p(C = 2\xy, xa). As shown in 
Equation 8, the Bayesian prior P co affects the judgment of unity. 
We investigated how P co affects D^" y (Figure 7). 

When the causal structure is defined, we can calculate the opti- 
mal estimate of the stimulus position for the cases of C = 1 and 
C = 2. When the audiovisual stimuli have independent causes, 
the optimal solutions are 



Xy,C=2 = Xy, 



XA,C=2 = X A - 



(10) 



When the audiovisual stimuli have a common cause, the optimal 
solution is 



xy, c=i = xa,c=i 



4-4-4-' 

„2 T "J 



(ID 



1(D) 



1 



2jt(0 2 w + oL) 



: exp 



2(< + 0 



dl. 



D^ et determines the likelihood of sensory integration. We inves- 
tigated the relationship between the parameters of the neural 
connection /;,- and the Bayesian prior distribution regarding the 
integration threshold. 

4.2. INTEGRATION THRESHOLD IN BAYESIAN MODEL 

Using the Bayesian approach, we can also calculate the integra- 
tion threshold D^ ay (distance within which auditory and visual 



We calculated the localization bias using the Bayesian model 
(Figure 6C). Here, we fixed P co = 0.2, ay = 3 [deg], and oa = 
6.5 [deg]. 

Both the Bayesian prior P co and recurrent connectivity h affect 
the integration threshold Do. Thus, the integration threshold Do 
validates the idea that the Bayesian prior P co corresponds to a 
recurrent connectivity h in the cortical neural network. 

4.3. NETWORK CONNECTIVITY REPRESENTS BAYESIAN PRIOR P co 

Synaptic plasticity is thought to be the basic phenomena under- 
lying learning. It could be said that a neural network learns a 
Bayesian prior by changing its connectivity. We investigated how 
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FIGURE 6 | Localization bias with spatial disparity D. Error bars 
represent SEMs. (A) Previous model Deneve et al. (1999). (B) Proposed 
model. (C) Bayesian model Kdrding et al. (2007); Sato et al. (2007). 
"Common cause" refers to the situation in which the model regarded that 
two sensory signals have a common cause (i.e., the network converged to 
a single bump state) and "independent cause" refers to the situation in 



which the network regarded that two sensory signals have independent 
causes (i.e., the network converged to a state of two single bumps or the 
MAP estimate of the Bayesian model corresponds to two sources). The 
negative bias indicates that the perceived auditory position is on the 
opposite side of the true position with respect to the position of the visual 
stimulus. 
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FIGURE 7 | Threshold of integration. (A) The thresholds of the Bayesian 
model are plotted with the prior of sensory integration. (B) The thresholds 
of the network model are plotted showing the ratio of the strengths of the 
excitatory connection, M\, and inhibitory connection, M%, of the recurrent 
network. (C) The thresholds of the network model are plotted with the 
range of excitatory connection, <n . (D) The thresholds of the network 
model are plotted with the range of lateral inhibition, 02. 



the parameters of the network connectivity h affect the integra- 
tion threshold, as shown in Figure 7. D^ et increases with the ratio 
between the strength of excitatory connection, M\ , and that of the 
inhibitory connection, M%. It approximately increases with the 
range of excitatory connection, 01, similarly to the ratio between 
Mi and M2, whereas it varies with a non-monotonic shape for the 
range of lateral inhibition, 02, as illustrated in Figure 7D. 

Let us focus on the excitatory connection that could be 
changed through Hebb's learning rule. As shown in Figures 7 A, 
B, D^ et increases with Mi in the same way as D^ ay increases with 
P co . This means that the Bayesian prior P co is represented as Mi in 
the network model. This result suggests that the neural network 



achieves Bayesian inference through learning appropriate prior 
information by adjusting the excitatory connection Mi . 

5. DISCUSSION 

We constructed a recurrent network model that distinguishes 
whether or not audiovisual stimuli have a common cause or dis- 
tinct causes. We showed that our model not only estimates the 
number of sources, but also reproduces the localization bias, as 
observed in psychophysical experiments Wallace et al. (2004). 
Previous studies have revealed that the Bayesian ideal observer 
model could explain psychophysical data on sensory integration 
Kording et al. (2007); Sato et al. (2007). Although a Bayesian 
model gives a normative account of the computational principle, 
it does not provide a neural implementation of optimal causal 
inference. Our model is a biologically plausible one of cortical 
circuitry, and it provides information about how the nervous 
system can implement the computation Carandini et al. (1997). 
To reveal how the nervous system implements Bayes' inference, 
we investigated the relationship between the synaptic connec- 
tion of the proposed model and the prior distribution in the 
Bayesian model. We found that the strength of the excitatory con- 
nection represents the prior distribution for the probability of 
integration. 

Previous research has used divisive normalization for the fir- 
ing rate, serving as a gain control. The network model extracted 
variables encoded by a population of noisy neurons Deneve 
et al. (1999). The neural activities converged to a smooth stable 
peak, and the position of the peak depended on the variables. 
Therefore, the position could be used to estimate these quantities 
in their model. Moreover, through proper tuning of the param- 
eters, the model closely approximated the maximum likelihood, 
which would be used by an ideal observer in most cases of inter- 
est. However, two or more localized activities could not coexist in 
the previous network model. The model thus could not simul- 
taneously estimate information about multiple sources, which 
is needed for living in a natural environment. We found that 
strong divisive normalization makes it hard for localized activi- 
ties to coexist. Iteration of Equation 3 makes the ratio of local 
excitations large, and eventually, only the largest one can survive. 
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This effect occurs if the exponent is greater than 1, We constructed 
a model in which an arbitrary number of local excitations could 
coexist by making the exponent equal to one. This simple normal- 
ization can be biologically implemented in a linear computation 
and shunting inhibition Carandini et al. (1997). Although our 
network may not achieve optimal inference for each source posi- 
tion, it is biologically plausible and can reproduce the properties 
of auditory-visual integration observed in psychophysical exper- 
iments. These results imply that normalization with a threshold 
linear function is important in multisensory integration with 
causal inference. 

We reproduced the results of psychophysical experiments 
showing localization bias in audiovisual integration. Whenever 
stimuli were unified, the model estimated that the auditory posi- 
tion would shift to the location of the visual stimulus. This 
phenomenon is caused by the difference in the reliability of the 
stimuli. That is, because visual information for source localiza- 
tion is much more reliable than auditory information, vision 
dominates sound. Moreover, it is also known that when an audi- 
tory signal is more reliable than a visual signal, sound dominates 
vision Alais and Burr (2004). It is reported that localization 
bias is observed in some cross-modal cues Pavani et al. (2000). 
Our model represents the reliability of stimuli by the strength 
of the input activity. It can be generalized to other types of cue 
integration by changing the strength of the input activity. 

The results of psychophysical experiments have been explained 
using Bayes' inference Kording et al. (2007); Sato et al. (2007). 



Bayes' inference is a method of reasoning that combines prior 
knowledge with current input data. In our brains, information 
about the external world is estimated on the basis of prior knowl- 
edge Doya et al. (2007). However, until now, it was unknown how 
prior knowledge can be represented in a neural circuit. We investi- 
gated how a neural network can implement prior knowledge. Our 
results suggest that neural networks learn an appropriate prior 
with synaptic plasticity. 

In the Bayesian model, negative bias is assumed to be caused 
by sensory noise Kording et al. (2007); Sato et al. (2007). Stimuli 
are unified when the distance between the perceived locations of 
audiovisual stimuli which are shifted from their original posi- 
tions is smaller than D^ ay ; on the other hand, when it is larger 
than D^ ay , the stimuli are not unified. The averaged bias of the 
non-unified case takes on a negative value. In our neural net- 
work model, not only sensory noise but also the interaction of 
localized activities has an effect on the negative bias. Localized 
activities repel each other through the effect of a Mexican-hat 
type of connectivity (Figure 3). This corresponds to implement- 
ing the prior distribution such that of the likely positions of 
different input sources, which has not been implemented in 
the previous Bayesian models Kording et al. (2007); Sato et al. 
(2007). It is unclear where causal inference is performed in the 
brain. If the repulsive effect were to be observed in a brain 
region that performs multisensory integration, it would sup- 
port the notion that our model is actually implemented in the 
brain. 
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